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Abstract 

Dialogue state tracking (DST) is a process to estimate the distribution of the dialogue states as a 
dialogue progresses. Recent studies on constrained Markov Bayesian polynomial (CMBP) frame¬ 
work take the first step towards bridging the gap between rule-based and statistical approaches for 
DST. In this paper, the gap is further bridged by a novel hybrid framework - recurrent polynomial 
network (RPN). RPN’s unique structure enables the framework to have all the advantages of CMBP 
including efficiency, portability and interpretability. Additionally, RPN achieves more properties 
of statistical approaches than CMBP. RPN was evaluated on the data corpora of the second and 
the third Dialog State Tracking Challenge (DSTC-2/3). Experiments showed that RPN can sig¬ 
nificantly outperform both traditional rule-based approaches and statistical approaches with similar 
feature set. Compared with the state-of-the-art statistical DST approaches with a lot richer features, 
RPN is also competitive. 

Keywords: Statistical Dialogue Management, Dialogue State Tracking, Recurrent Polynomial 
Network 


1. Introduction 

A task-oriented spoken dialogue system (SDS) is a system that can interact with a user to accomplish 
a predefined task through speech. It usually has three modules: input, output and control, shown in 
Figure [T] The input module consists of automatic speech recognition (ASR) and spoken language 
understanding (SLU), with which the user speech is converted into text and semantics-level user 
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dialogue acts are extracted. Once the user dialogue acts are received, the control module, also called 
dialogue management accomplishes two missions. One mission is called dialogue state tracking 
(DST), which is a process to estimate the distribution of the dialogue states, an encoding of the 
machine’s understanding about the conversion as a dialogue progresses. Another mission is to 
choose semantics-level machine dialogue acts to direct the dialogue given the information of the 
dialogue state, referred to as dialogue decision making. The output module converts the machine 
acts into text via natural language generation and generates speech according to the text via text-to- 
speech synthesis. 

Dialogue management is the core of a SDS. Traditionally, dialogue states are assumed to be ob¬ 
servable and hand-crafted rules are employed for dialogue management in most commercial SDSs. 
However, because of unpredictable user behaviour, inevitable ASR and SLU errors, dialogue state 
tracking and decision making are difficult ( Williams and Young[ 20071. Consequently, in recent 
years, there is a research trend from rule-based dialogue management towards statistical dialogue 
management. Partially observable Markov decision process (POMDP) framework offers a well- 
founded theory to both dialogue state tracking and decision making in statistical dialogue manage¬ 


ment ( |Roy et ahj 2000[ Zhang et ahj 2001; Williams and Youn^ 2005 [ 2007; Thomson and Youn^ 

2010t Gasic and Young[|2011t[%ung et al.[ 20101. In previous studies of POMDP, dialogue state 
tracking and decision making are usually investigated together. In recent years, to advance the re¬ 
search of statistical dialogue management, the DST problem is raised out of the statistical dialogue 
management framework so that a bunch of models can be investigated for DST. 


Input Module 



Figure 1: Diagram of a spoken dialogue system (SDS) 


Most early studies of POMDP on DST were devoted to generative models ( [Young et al. 20101, 
which learn joint probability distribution over observation and labels. Fundamental weaknesses of 


generative model was revealed by the result of ( 

Williams 

2012 

1. In contrast, discriminative state 

tracking models have been successfully used for SDSs (Deng et al. 

20131. Compared to generative 


models where assumptions about probabilistic dependencies of features are usually needed, dis¬ 
criminative models directly model probability distribution of labels given observation, enabling rich 
features to be incorporated. The results of the Dialog State Tracking Challenge (DSTC) ([Williams 


et al. 2013[ Henderson et al.[ [2014b|a| further demonstrated the power of discriminative statisti¬ 


cal models, such as Maximum Entropy (MaxEnt) (Eee and Eskenazi 20131, Conditional Random 
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Field ( Lee[ 20131, Deep Neural Network (DNN) (Sun et al. 2014al, and Recurrent Neural Network 
(RNN) ( Henderson et aLj 2014d| ). In addition to discriminative statistical models, discriminative 
rule-based models have also been investigated for DST due to their efficiency, portability and inter- 


pretability and some of them showed good performance and generalisation ability in DSTC (Zilka 


et al.[ 2013t Wang and Lemon[ 2013||. However, both rule-based and statistical approaches have 


some disadvantages. Statistical approaches have shown large variation in performance and poor 
generalisation ability due to the lack of data ( William^ 20121. Moreover, statistical models usu¬ 
ally have more complex model structure and features than rule-based models, and thus can hardly 
achieve efficiency, portability and interpretability as rule-based models. As for rule-based models, 
their performance is usually not competitive to the best statistical approaches. Additionally, since 
they require lots of expert knowledge and there is no general way to design rule-based models with 
prior knowledge, they are typically difficult to design and maintain. Furthermore, there lacks a way 
to improve their performance when training data are available. 

Recent studies on constrained Markov Bayesian polynomial (CMBP) framework take the first 


step towards bridging the gap between rule-based and statistical approaches for DST (Sun et al. 


2014b Yu et al.[ 20151. CMBP formulate rule-based DST in a general way and allow data-driven 


rules to be generated. Concretely, in the CMBP framework, DST models are defined as polynomial 
functions of a set of features whose coefficients are integer and satisfy a set of constraints where 
prior knowledge is encoded. The optimal DST model is selected by evaluating each model on 
training data. |Yu et al.| ( |2015| l further extended CMBP to real-coefficient polynomial where the real 
coefficients can be estimated by optimizing the DST performance on training data using grid search. 
CMBP offers a way to improve the performance when training data are available and achieves com¬ 
petitive performance to the state-of-the-art statistical approaches, while at the same time keeping 
most of the advantages of rule-based models. Nevertheless, adding features to CMBP is not as easy 
as to most statistical approaches because on the one hand, the features usually need to be probabil¬ 
ity related features, on the other hand, additional prior knowledge is needed to constrain the search 
space. For the same reason, increasing the model complexity, such as by using higher-order poly¬ 
nomial, by introducing hidden variables, etc. also requires additional suitable prior knowledge to 
be introduced to limit the search space not too large. Moreover, CMBP can hardly fully utilize the 
labelled data because in practice its polynomial coefficients are set by grid search. 

In this paper, a novel hybrid framework, referred to as recurrent polynomial network (RPN), is 
proposed to further bridge the gap between rule-based and statistical approaches for DST. Although 


the basic idea for transforming rules to neural networks has been there for many years (Cloete and 


Zurada 20001, few work has been done for dialogue state tracking. RPN can be regarded as a kind 


of human interpretable computation networks, and its unique structure enables the framework to 
have all the advantages of CMBP including efficiency, portability and interpretability. Additionally, 
RPN achieves more properties of statistical approaches than CMBP. In general, RPN has neither 
restriction to feature type, nor search space issue to be concerned about, so adding features and 
increasing the model complexity are much easier in RPN. Furthermore, RPN can better explore the 
parameter space than CMBP with labelled data. 

The DSTCs have provided the first common testbed in a standard format, along with a suite of 
evaluation metrics to facilitate direct comparisons among DST models ( Williams et al.[ 20131. To 
evaluate the effectiveness of RPN for DST, both the dataset from the second Dialog State Tracking 
Challenge (DSTC-2) which is in restaurants domain (Henderson et al. 2014b|l and the dataset from 


the third Dialog State Tracking Challenge (DSTC-3) which is in tourists domain (Henderson et al. 
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2014a I are used. For both of the datasets, the dialogue state tracker receives SLU A^-best hypotheses 
for each user turn, each hypothesis having a set of act-slot-value tuples with a confidence score. The 
dialogue state tracker is supposed to output a set of distributions of the dialogue state. In this paper, 
only joint goal tracking, which is the most difficult and general task of DSTC-2/3, is of interest. 

The rest of the paper is organized as follows. Section discusses ways of bridging rule-based 
and statistical approaches. Section [^formulates RPN. The RPN framework for DST is described in 
section]^ followed by experiments in section]^ Finally, section [^concludes the paper. 


2. Bridge Rule-based and Statistical Approaches 

Broadly, it is straightforward to come up with two possible ways to bridge rule-based and statis¬ 
tical approaches: one starts from rule-based models, while the other starts from statistical models. 
CMBP takes the first way, which is derived as an extension of a rule-based model ( Sun et alT||2014b| 
Yu et al.[|2015|). Inspired by the observation that many rule-based models such as models proposed 


by Wang and Lemon (2013) and Zilka et al. (2013) are based on Bayes’ theorem, in the CMBP 
framework, a DST rule is defined as a polynomial function of a set of probabilities since Bayes’ 
theorem is essentially summation and multiplication of probabilities. Here, the polynomial coeffi¬ 
cients can be seen as parameters. To make the model have good DST performance, prior knowledge 
or intuition is encoded to the polynomial functions by setting certain constraints to the polynomial 
coefficients, and the coefficients can further be optimized by data-driven optimization. Therefore, 
starting from rule-based models, CMBP can directly incorporate prior knowledge or intuition into 
DST, while at the same time, the model is allowed to be data-driven. 

More concretely, assuming that both slot and value are independent, a CMBP model can be 
defined as 

h+i{v) = V 

s.t. constraints (1) 

where ht{v), P^{v), Pf{v), Pi^{v), Pf{v), are all probabilistic features which are defined as 
below: 

• ht{v)'. belief of “the value being v at turn f” 

• P^(v): sum of scores of SLU hypotheses informing or affirming value v at turn t 

• Pf (v): sum of scores of SLU hypotheses denying or negating value v at turn t 


'^{u,None} 

b^: belief of the value being ‘None’ (the value not mentioned) at turn t, i.e. 


bl 


Z]j;YNone bt{v ) 

and 'P(-) is a multivariate polynomial functiorQ 

Pil'O, • • • , I'D) = 


0<ki<---<k„<D 


,kn kki 

l<i<n 


= 1 - 


( 2 ) 


1. The notation of f2o<ki< - <k <d shorthand for series of n nested sums over ranges bounded, i.e. 

^0<ki<D ^ki<k2<D ■ ■ ■ 
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where D + 1 is the number of input variables, n is the order of the polynomial, gki,-,k„ is 
the parameter of CMBP. Order 3 gives good trade-off between complexity and performance, hence 
order 3 is used in our previous work ( |Sun et ahj 2014bt Yu et al.[ 2015| ) and this paper. 

The constraints in equation ([T]l encode all necessary probabilistic conditions (Yu et al. 20151. 
For instance, 

{)<P^{v) + P^{v)<l (3) 


0 < bt{v) < 1 


(4) 


The constraints in equation ([T]l also encode prior knowledge or intuition ( Yu et ah] 20151. For 
example, the rule “goal belief should be unchanged or positively correlated with the positive scores 
from SLU” can be represented by 


dPtXiiv) 


> 0 


(5) 


The definition of CMBP formulates a search space of rule-based models, where it is easy to 
employ data-driven criterion to find a rule-based model wifh good performance. Considering CMBP 
is originally mofivafed from Bayesian probabilify operafion which leads fo fhe nafural use of infeger 
polynomial coefficients {g G Z), the data-driven optimization can be formulated by an integer 
programming program ( |Sun et aLj|2014b| Yu et al.[ 20151. Additionally, CMBP can also be viewed 
as a statistical approach. Hence, the polynomial coefficients can be extended to real numbers. The 
optimization of real-coefficient can be done by first getting an integer-coefficient CMBP and then 
performing hill climbing search (Yu et al. 2015|l. 


3. Recurrent Polynomial Network 

Recurrent polynomial network, which is proposed in this paper, takes the other way to bridge rule- 
based and statistical approaches. The basic idea of RPN is to enable a kind of statistical model 
to take advantage of prior knowledge or intuition by using the parameters of rule-based models to 
initialize the parameters of statistical models. 

Computational networks have been researched for decades from very basic architectures such 
as perceptron ( Rosenbl^ 19581 to today’s various kinds of deep neural networks. Recurrent com¬ 
putational networks, a class of computational networks which have recurrent connections, have also 
been researched for a long time, from fully recurrent network to networks with relatively complex 
structures such as Long short-term memory (LSTM) ( Hochreiter and Schmidhub^ 1997| l. Like 
common neural networks, RPN is a statistical approach so it is as easy to add features and try com¬ 
plex structures in RPN as in neural networks. However, compared with common neural networks 
which are “black boxes”, an RPN can essentially be seen as a polynomial function. Hence, con¬ 
sidering that a CMBP is also a polynomial function, the encoded prior knowledge and intuition in 
CMBP can be transferred to RPN by using the parameters of CMBP to initialize RPN. In this way, 
it bridges rule-based models and statistical models. 

A recurrent polynomial network is a computational network. The network contains multiple 
edges and loops. Each node is either an input node, which is used to represent an input value, or a 
computation node. Each node x is set an initial value at time 0, and its value is updated at time 
1, 2, • • •. Both the type of edges and the type of nodes decide how the nodes’ values are updated. 
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There are two types of edges. One type, referred to as type-1, indicates the value updating at time t 
takes the value of a node at time t — 1, i.e. type-1 edges are recurrent edges, while the other type, 
referred to as type-2, indicates the value updating at time t takes another node’s value at time t. 
Except for loops made of “type-1” edges, the network should not have loops. For simplicity, let Ix 
be the set of nodes y which are linked to node x by a type-1 edge, Ix be the set of nodes y which 
are linked to node x by a type-2 edge. Based on these definitions, two types of computation nodes, 
sum and product, are introduced. Specifically, at time f > 0, if node x is a sum node, its value 
is updated by 

uW = ^ Wx,yU^y~^'' + ^ Wx,yU^y'^ (6) 

yei,, 

where m G M are the weights of edges. 

Similarly, if node x is a product node, its value is updated by 


»<■> = 


n 

y&Jx 




n 

y&L 


u, 


(jyMx,y 


(V) 


where Mx,y and Mx,y are integers, denoting the multiplicity of the type-1 edge yi, and the multi¬ 
plicity of the type-2 edge yi respectively. It is noted that only m are parameters of RPN while 
Mx^y, Mx,y are constant given the structure of an RPN. 



Figure 2: A simple example of RPN. The type of node a, b, c, d is input, input, product, and sum 
respectively. Edge da is of type-1, while the other edges are of type-2. Ma,c = 2, 

Mb,c = Mc,d = Md,d = 1 - 


Fet denote the vector of computation nodes’ values and the vector of input nodes’ 

values at time t respectively, then a well-defined RPN can be seen as a polynomial function as 
below. 

tiW =p (8) 

where V is defined by equation Q. For example, for the RPN in figure]^ its corresponding poly¬ 
nomial function is 

= (9) 

Each computation node can be regarded as an output node. For example, for the RPN in figure 
1^ node c and node d can be set as output nodes. 
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4. RPN for Dialogue State Tracking 


As introduced in section[TJ in this paper, the dialogue state tracker receives SLU A^-best hypotheses 
for each user turn, each hypothesis having a set of act-slot-value tuples with a confidence score. 
The dialogue state tracker is supposed to output a set of distributions of the joint user goal, i.e.. 


the value for each slot. For simplicity and consistency with the work of Sun et al. (2014b I and 
Yu et al.|| ( 2015| l, slot and value independence are assumed in the RPN model for dialogue state 
tracking though neither CMBP nor RPN is limited to the assumptions. In the rest of the pa¬ 
per, bt{v),P^+{v), Pf {v ), Pt+ (v ), Pf (v) are abbreviated by bt,P^,P^ ,P^,Pj- respectively in 
circumstances where there is no ambiguity. 


4.1 Structure 

Before describing details of the structure used in the real situations, to help understand the corre¬ 
sponding relationship between RPN and CMBP, let’s first look at a simplified case wifh a smaller 
fealure sef and a smaller order, which is a corresponding relationship befween fhe RPN shown in 
figure]^ and 2-order polynomial (lOl wifh feafures bt-i, P^, 1: 

b, = l-{l-bt-i){l-P+) 


= bt-i + PP - PPbt.i 


( 10 ) 


Recall fhaf a CMBP of polynomial order 2 wifh 3 feafures is fhe following equafion (refer fo 
equation ([^): 


Pil'O, P, p) — E 9k\^k2 n ifc, (11) 

0< fci < fc 2<2 1<*<2 

The RPN in figure has fhree layers. The firsf layer only confains inpuf nodes. The second 
layer only confains producf nodes. The fhird layer only confains sum nodes. Every producf node in 
fhe second layer denofes a monomial of order 2 such as {bt-iY, bt-i, P^ and so on. Every producf 
node in fhe second layer is linked fo fhe sum node in fhe fhird layer whose value is a weighfed sum 
of value of producf nodes. Wifh weighf sef according fo coefficienfs in equafion ( [T0| ), fhe value of 
sum node in fhe fhird layer is essentially fhe bt in equafion ( [TO] ). 

Eike fhe above simplified case, a layered RPN sfrucfure shown in figure]^ is used for dialogue 
sfafe fracking in our firsf frial which essentially corresponds fo 3-order CMBP, fhough fhe RPN 
framework is nof limifed fo fhe layered fopology. Recall fhaf a CMBP of polynomial order 3 is used 
as shown in fhe following equafion (refer fo equafion (|^): 

■ ■ Pd) = ^ 9ki,k2,k3 ifci (12) 

0<ki<k2<k3<D 1<*<3 

2. For DSTC-2/3 tasks, one slot can have at most one value, i.e. 0 < < 1- Since value independence 

is assumed, to strictly maintain that relation, the belief is rescaled to ensure the sum of the belief of each value 
plus the belief of ‘None’ to be 1 when the belief is being output. Actually, to enable RPN strictly maintain that 
relation, our original design of RPN had a “normalization” step when passing the belief from turn t to turn f + 1. The 
“normalization” step will rescale the belief to make the sum of the belief of each value plus the belief of ‘None’ to 
be 1. Our later experiment, however, demonstrated that there was no significant performance difference between the 
RPN with and without the “normalization” step. Therefore, in practice, for simplicity, the “normalization” step can 
be omitted, value independence can be assumed, and the only thing needed is to rescale the belief when it is being 
output. 
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Figure 3: A simple example of RPN for DST. 



Figure 4: RPN for DST. 
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Let (/, i) denote the index of i-th node in the l-th layer. The detailed definitions of each layer 
are as follows: 


• First layer / Input layer: 

Input nodes are features at turn t, which corresponds to variables in CMBP in section]^ i.e. 


- u 

- u 

- u 

- u 

- u 

- u 


it) 

(TO) 

(i) 

( 1 , 1 ) 

(i) 

( 1 , 2 ) 

it) 

(1.3) 
(i) 

(1.4) 
it) 

(1.5) 


bt-1 

Pt 

Pf 

Pt 

Pt' 


While 7 features are used in previous work of CMBP ( Sun et al. 2014b Yu et al. 20151, only 
6 of them are used in RPN with feature b'[_i removecj^ Since our experiments showed the 
performance of CMBP would not become worse without feature bl_i, to make the structure 
more compact, is not used in this paper for RPN. In accordance to this, CMBP mentioned 
in the rest of paper does not use this feature either. 


• Second layer: 

The value of every product node in the second layer is a monomial like the simplified case. 
And every producf node has indegree 3 which is corresponding fo fhe order of CMBP. 

Every monomial in CMBP is fhe producf of fhree repeafable feafures. Correspondingly, fhe 
value of every producf node in second layer is fhe producf of values of fhree repeafable nodes 
in fhe firsl layer. Every friple {ki, k 2 , fc3)(0 < ki < k 2 < k^ < 5) is enumerated fo create a 
producf node x = (2, i) in second layer fhaf nodes (1, fci), (1, /C 2 ), (1, k^) are linked fo. i.e. 
4 = {( 1 , ki), (1, fca), (1, fes)}. And fhus 

And differenl node in fhe second layer is created by a disfincf friple. So given fhe 6 inpuf 

55 5 

feafures, fhere are ^ Z] 1 = ( ^ 3 ” ) = 56 nodes in fhe second layer. 

fcl=0 k2=ki k3=k2 

To simplify fhe nofafion, a bijecfion from nodes fo monomials is defined as: 

T : {x\x is fhe index of a node in fhe 2”*^ layer} —)• {(/ci, k 2 , 4)10 < A:i < 4 < 4 < D} 

(13) 

T{x) = ( 4 , 4 , 4 ) ^ 

where Z? + 1 = 6 is fhe number of nodes in fhe firsf layer, i.e. inpuf feafure dimension. 


• Third layer: 

The value of sum node x = (3, 0) in fhe fhird layer is corresponding fo fhe oufpuf value of 
CMBP. 

3. is the belief of value being ‘None’, whose precise definition is given in section^ 
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Every product nodes in the second layer are linked to it. Node x’s value is a weighted 


sum of values of product node u^l where the weights correspond to gki,k 2 ,k 3 in equation (|12|. 


With only sum and product operation involved, every node’s value is essentially a polynomial 
of input features. And just like recurrent neural network, node at time t can be linked to node at 
time f + 1. That is why this model is called recurrent polynomial network. 


The parameters of the RPN can be set according to CMBP coefficients gki,k 2 ,k 3 in equation (12 1 
so that the output value is the same as the value of CMBP, which is a direct way of applying prior 


knowledge and intuition to statistical models. It is explained in detail in section 4.4 


4.2 Activation Function 

In DST, the output value is a belief which should lie in [0,1], while values of computational nodes 
are not bound by certain interval in RPN. Experiments showed that if weights are not properly set 
in RPN and a belief bt-i output by RPN is larger than 1, then bt may grow much larger because bt 
is the weighted sum of monomials such as (6f_i)^. Belief of later turns such as bt+io will tend to 
infinity. 

Therefore, an activation function is needed to map bt to a legal belief value (referred to as b[) 
in (0,1). 3 kinds of functions, the logistic function, the clip function, and the softclip function have 
been considered. A logistic function is defined as 

loqistic(x) = -^- r (15) 

^ ^ ^ 1 - 1 - Q-ri(x-xo) 


If can map M fo (0,1) by selling L = 1. However, since basically Ihe RPN designed for dialogue 
slafe fracking does similar operafion as CMBP which is mofivaled from Bayesian probabilily oper¬ 
ation (Yu el al. 20151, inluilively we expecl Ihe activation funclion lo be linear on [0,1] so fhaf lilfle 
dislorlion is added lo Ihe belief. 

As an allemafion, a clip funclion is defined as 


clip{x) 


0 if X < 0 
< X if 0 < X < 1 
1 if X > 1 


(16) 


II is linear on [0,1]. However, if b[ = clip{bt), bt 0 [0,1] and £ is Ihe loss function, 

Wt ~ WtWt 


(17) 


Thus, ^ would be 0 whatever ^ is. This gradienl vanishing phenomenon may affecl Ihe effec- 


dC 


liveness of backpropagalion teaming in section 4.5 

So an activation function softclip{-) is inteoduced, which is a combination of logistic function 
and clip function. Eel e denote a small value such as 0.01, S denote Ihe offsel of sigmoid function 
such lhal sigmoid (e — 0.5 + 5) = e. Here Ihe sigmoid function refers lo Ihe special case of Ihe 
logistic function defined by Ihe formula 


sigmoid(x) = - 

1 + e ^ 


( 18 ) 
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Figure 5: Comparison among clip function, logistic function, and softclip function 


The softclip function is defined as 


softclip{x) = 


sigmoid {x 

< X 


sigmoid {x 


0.5 + (5) if X < e 

ife<x<l — e 
0.5 — (5) if X > 1 — e 


(19) 


softclip : M —)■ (0,1) is a non-decreasing, continuous function. However, It is not differentiable 
when X = e or X = 1 — e. So we defined ifs derivafive as follows: 


dsoftclip{x) 

dx 


A 


' dsigmoid{x—0.b+5) 
dx 

< 1 

dsigmoid{x—0.b—5) 
y dx 


if X < e 

ife<x<l — e 
if X > 1 — e 


( 20 ) 


If is like a clip function. However, ifs derivafive may be small on some inpufs buf is nof zero. 
Figure shows fhe comparison among clip function, logisfic function, and soffclip funcfion. In 
practice, soffclip function has demonsfrafed beffer performance fhan bofh clip and logisfic funcfior|^ 
and was used in fhe resf of fhe paper. 

Wifh fhe activation funcfion, a new fype of compufafion node, referred fo as activation node, is 
infroduced. Acfivafion node only fakes one inpuf and only has one inpuf edge of fype-2, i.e |ia;| = 1 
and Ix = %. The value of an activation node x is calculafed as 


= softclip 


( 21 ) 


4. An experiment is done on the DSTC-2 dataset where RPNs using different activation functions are trained on 
dstc2trn and tested on dstc2dev. The accuracy and L2 of RPNs with dip, logistic, and softclip function are 
(0.779, 0.329), (0.789, 0.352), (0.790, 0.317) respectively. In particular, logistic functions with several different p are 
evaluated and the result reported here is the best one. 
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where jx denotes the input node of node x. i.e. lx = {jx}- 

The activation function is used in the rest of the paper. Figure gives an example of RPN with 
activation function, whose structure is constructed by adding an activation function to the RPN in 
figure]^ 



Figure 6: RPN for DST with activation functions 


4.3 Further Exploration on Structure 


Adding features to CMBP is not easy because additional prior knowledge is needed to add to keep 
the search space not too large. Concretely, adding features can introduce new monomials. Since the 
trivial search space is exponentially increasing as the number of monomials, the search space tends 
to be too large to explore when new features are added. Hence, to reduce the search space, additional 
prior knowledge is needed, which can introduce new constraints to the polynomial coefficients. For 
the same reason, increasing the model complexity also requires additional suitable prior knowledge 
to be added to limit the search space not too large in CMBP. 

In contrast to that, since RPN can be seen as a statistically model, it is as easy as most statistical 
approaches such as RNN to add new features to RPN and use more complex structures. At the same 
time, no matter what new features are used and how complex the structure is, RPN can always take 


advantage prior knowledge and intuition which is discussed in section 4.4 In this paper, both new 
features and complex structure are explored. 

Adding new features can be done by just adding input nodes which correspond to the new 
features, and then adding product nodes corresponding to the new possible monomials introduced by 
the new features. In this paper, for slot s, value v at turn t, in addition to /o ~ /s which are defined 
as bt-i{v), P^{v), Pf{v), P^{v), P^{v), and 1 respectively, 4 new features are investigated, /g 
and /y are features of system acts at the last turn: for slot s, value v at turn t, 


• /e = canthelp{s, t, v) U canthelp.missing slot _value{s, t) =1 if the system cannot offer 
a venue with the constraint s = n or the value of slot s is not known for the selected venue, 
otherwise 0. 

• /y = select{s, t, v) =1 if the system asks the user to pick a suggested value for slot s, other¬ 
wise 0. 
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fe and are introduced because user is likely to change their goal if given machine acts canthelp{s, t,v), 
canthelp.missing.slot_value{s,t) and select{s,t,v). fg and /g are features of user acts at the 
current turn: for slot s, value v at turn t, 

• f8 — inform{s, t,v) = 1 if one of SLU hypotheses from the user is informing slot s is v, 
otherwise 0. 

• /g = deny{s, t, v) =1 if one of SLU hypotheses from the user is denying slot s is v, otherwise 

0 . 


fs and /g are features about SLU acttype, introduced to make system robust when the confidence 
scores of SLU hypothesis are not reliable. 

In this paper, the complexity of evaluating and training RPN for DST would not increase sharply 
because a constant order 3 is used and number of product nodes in the second layer grows from 56 
to 220 when number of features grows from 6 to 10. 

In addition to new features, RPN of more complex structure is also investigated in this paper. 
To capture some property just like belief bt of dialogue process, a new sum node x = (3,1) in 
the third layer is introduced. The connection of (3,1) is the same as (3, 0), so it introduces a new 
recurrent connection. The exact meaning of its value is unknown. However, it is the only value 
used to record information other than bt of previous turns. Every other input features except bt are 
features of current turn t. Compared with bt, there are fewer restrictions on the value of (3,1) since 
its value is not directly supervised by the label. Hence, introducing (3,1) may help to reduce the 
effect of inaccurate labels. 

The structure of the RPN with 4 new features and 1 new sum node, together with new activation 
nodes introduced in section 4.2 is shown in figure]^ 



f (t-i) f (t-i) f (t-ivct-i) 

Jl /2 J8 J9 


At) At) 
Jl Jz 


At) At) 
Jb J9 


At+i) At+i) At+i)At+i) 

Jl Jz Js J 9 


Figure 7: RPN with new features and more complex structure for DST 


4.4 RPN Initialization 

Like most neural network models such as RNN, the initialization of RPN can be done by setting 
each weight, i.e. w and w, to be a small random value. However, with its unique structure, the 
initialization can be much better by taking advantage of the relationship between CMBP and RPN 
which is introduced in section 10 
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When RPN is initialized according to a CMBP, prior knowledge and constraints are used to set 
RPN’s initial parameters as a suboptimum point in the whole parameter space. RPN as a statistical 
model can fully utilize the advantages of statistical approaches. However, RPN is better than real 
CMBP while they both use data samples to train parameters. In the work of Yu et al. (20151, real- 
coefficient CMBP uses hill climbing to adjust parameters that are initially not zero and the change 
of parameters are always a multiple of 0.1. RPN can adjust all parameters including parameters 
initialized as 0 concurrently, while the complexity of adjusting all parameters concurrently is nearly 
the same as adjusting one parameter in CMBP. Besides, the change of parameters can be large or 
small, depending on learning rate. Thus, RPN and CMBP both are bridging rule-based models and 
statistical ones, while RPN is a statistical model utilizing rule advantages and CMBP is a rule model 
utilizing statistical advantages. 

In fact, given a CMBP, an RPN can achieve the same performance as the CMBP just by setting 
its weights according to the coefficients of the CMBP. To illustrate that, the steps of initializing the 
RPN in figure |7] with a CMBP of features /o ~ fg is described below. 

First, to ensure that the new added sum node x = (3,1) will not influence the output bt in RPN 
with initial parameters, Wx,y is set to 0 for all y. So node x’s value Ux'^ is always 0. 

Next, considering the RPN in figure]^ has more features than CMBP does, the weights related 
the new features should be set to 0. Specifically, suppose node x is the sum node in the third layer 
in RPN denoting bt before activation and node y is one of the product nodes in the second layer 
denoting a monomial, if product node y is products of features fe, f 7 , fs, fg or the added sum node, 
then node y’s value is not a monomial in CMBP, then weights Wx,y should be set to 0. 

Finally, if product node y is the product of features fg ~ /s, suppose the order of CMBP is 3, 
then T{y) = (fei, k 2 , kg) defined in equation ([13]) should satisfy 0 < ki < k 2 < kg < 5. Weights 


w 


xy 


should be initialized as gki,k 2 ,k 3 which is the coefficient of fkifk 2 fk 3 in CMBP. Thus, 


^x,y — 


9ki^k2,k^ 

0 


if X = (2,0) and T{x) 
otherwise 


{ki,k2,kg) 


( 22 ) 


For RPN of other structures, the initialization can be done by following similar steps. 
Experiments show that after training, there are only a few weights larger than 0.1, no matter 
using CMBP or random initialization. 


4.5 Training RPN 

Suppose T{d) is the number of turns in dialogue d, H{d, s, t) is the set of values corresponding to 
slot s appearing in SLU hypothesis in turn t in dialogue d, bd^s,t{v) is the output belief of value v in 
dialogue d and ld,s,t{v) is the indicator of goal s = v being part of joint goal at turn t in the label of 
dialogue d. The cost function is defined as 


r(d)-i 

^ ^ X/ X/ X/ - ld,s,tiv)) (23) 

^ d i=0 s v&Ul^f^H{d,s,i) 

Training process of a mini-batch can be divided into two parts: forward pass and backward pass. 


14 








Recurrent Polynomial Network for Dialogue State Tracking 


foreach Mini batch m do 

Initialize Awxy = 0, Awx,y = 0 for every x, y 
Initialize the value of reeurrent node at turn 0 as 0 
foreach Training dialogue d, slot s, value v in mini batch m do 
T ^ the number of turns of eurrent training sample 
/* forward pass 
for f ^ 1 to T do 
for d ^ 1 to 4 do 

foreach node x in time t, layer d do 
evaluate ni*^ 
if X is output node then 
^ <— 2{ux ^ — It) 

else 


I I I I 4 *^ ^ 0 

/* backward pass 

for t ^ T to 1 do 
for d ^ 4 to 1 do 

foreach node x in time t, layer d do 
foreach node y e Ix do 

/* Node y is linked to node x by 
edge 




d) I xd)du 


ft ' + <5: 


(t) 


du 


(t) 


foreach node y G Ix do 

/* Node y is linked to node x by 
edge 


^d-i) ^ _l_ 

Uy \ Uy Vx ^ / + _1 


du 


(t-1) 


for f t— 1 to T do 
for d ^ 1 to 4 do 

foreach sum node x in time t, layer d do 
foreach node y G Ix do 

AWxy ^ AWxy + adx^ Uy^ 

foreach node y G Ix do 

Awxy ^ Awxy 4 “ cxdx Uy 
foreach edge{x, y) do 

tCxy ^ tUxy AWxy 
dtxy ^ d)xy AWxy 

Algorithm 1: Training Algorithm of RPN for DST 


a 


a 


"type-2" 


"type-1" 


*/ 


*/ 


*/ 


*/ 
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Forward Pass For each training sample, every node’s value at every time is evaluated first. When 
evaluating , values of nodes in Ix and lx should be evaluated before. The computation formula 
should be based on the type of node x. In particular, for a layered RPN structure, we can simply 
evaluate earlier than jf or ti = t 2 and xi’s layer number is smaller than a: 2 ’s. 

Backward Pass Backpropagation through time (BPTT) is used in training RPN. Let error of node 

(iy') f' f i") 

X at time the Sy = If a node x is an output node, then should be initialized according 

OUx 

to its label It and output value Ux \ otherwise dx^ should be initialized to 0. After a node’s error 
dx^ is determined, it can be passed to 6 y ^\y G Ix) and 6 y^ {y G Ix)- Error passing should follow 
the reversed edge’s direction. So the order of nodes passing error can follow the reverse order of 
evaluating nodes’ values. 

When every 6 x'^ has been evaluated, the increment on weight Wxy can be calculated by 


dC 


Awxy = a 

ow 


= a 


xy 

y-y dC dux'^ 

i=l dux^ dWxy 


= a 




it) 


2=1 


(24) 


where a is the learning rate. Awxy can be evaluated similarly. 

Note that only Wxy and Wxy are parameters of RPN. 

The complete formula of evaluating node value Ux^ and passing error 5x^ can be found in 
appendix. 

In this paper, mini-batch is used in training RPN for DST with batch size 8. In each training 
epoch, Awxy and Awxy are calculated for every training sample and added together. The weight 
Wxy and Wxy is updated by 


(25) 

(26) 

The pseudocode of training is shown in algorithm [T] 


‘^xy — ’^xy AWxy 
Hxy — '^xy AWxy 


5. Experiment 

As introduced in section [T] in this paper, DSTC-2 and DSTC-3 tasks are used to evaluate the pro¬ 
posed approach. Both tasks provide training dialogues with turn-level ASR hypotheses, SLU hy¬ 
potheses and user goal labels. The DSTC-2 task provides 2118 training dialogues in restaurants do¬ 


main ( [Henderson et al.[|20l4b[ ), while in DSTC-3, only 10 in-domain training dialogues in tourists 
domain are provided, because the DSTC-3 task is to adapt the tracker trained on DSTC-2 data to 
the new domain with very few dialogues ( [Henderson et al.j 2014a[ ). Table [T] summarizes the size of 
datasets of DSTC-2 and DSTC-3. 

The DST evaluation criteria are the joint goal accuracy and the L2 ( Henderson et al.[ 2014b|a| . 
Accuracy is defined as the fraction of turns in which the tracker’s 1-best joint goal hypothesis is 
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Task 

Dalasel 

#Dialogues 

Usage 


dstc2trn 

1612 

Training 

DSTC-2 

dstc2dev 

506 

Training 


dstc2eval 

1117 

Tesl 

DSTC-3 

dstcBseed 

10 

Nof used 

dstcBeval 

2265 

Tesl 


Table 1: Summary of data corpora of DSTC-2/3 


correct, the larger the better. L2 is the L2 norm between the distribution of all hypotheses output 
by the tracker and the correct goal distribution (a delta function), the smaller the better. Besides, 
schedule 2 and labelling scheme A defined in ( Henderson et aL| 20131 are used in both tasks. Specif¬ 
ically, schedule 2 only counts the turns where new information about some slots either in a system 
confirmation action or in the SLU list is observed. Labelling scheme A is that the labelled state 
is accumulated forwards through the whole dialogue. For example, the goal for slot s is “None” 
until it is informed as s = u by the user, from then on, it is labelled as v until it is again informed 
otherwise. 


It has been shown that the organiser-provided live SLU confidence was nof good enough (Zhu 


ef al.||20f4||Sun ef al.||2014a| ). Hence, mosf of fhe sfafe-of-fhe-arf resulfs from DSTC-2 and DSTC- 


3 used refined SLU (eifher explicifly rebuild a SLU componenf or lake fhe ASR hypolheses info 
fhe Irackers ( Williams| |2014t Sun ef al.] |2014a Henderson ef al.| |2014d|c ; Kadlec ef al.| |2014[ Sun 


ef al.[|2014b|)). In accordance lo Ihis, excepl for fhe resulfs direclly laken from olher papers (shown 


in fable an d§, all experimenls in fhis paper used fhe oufpul from a refined semanlic parser ( |Zhu| 
ef al.[|2014||Sun ef~ST 2014al instead of fhe live SLU provided by fhe organizer. 


For all experimenls, MS^ is used as fhe training criterion and full-batch batch is used. For 
both DSTC-2 and DSTC-3 tasks, dstc2trn and dstc2dev are used, 60% of the data is used for 
training and 40% for validation, unless otherwise stated. Validation is performed every 5 epochs. 
Learning rate is set to 0.6 initially. During the training, learning rate is halved each time the per¬ 
formance does not increases. Training is stopped when the learning rate is sufficiently small, or the 
maximum number of training epochs is reached. Here, the maximum number of training epochs is 
set to 40. L2 regularization is used for all the experiment^ 


5.1 Investigation on RPN Configurations 

This section describes the experiments comparing different configurations of RPN. All experiments 
were performed on both the DSTC-2 and DSTC-3 tasks. 

As indicated in section [4~4{ an RPN can be initialized by a CMBP Table shows the perfor¬ 
mance comparison between initialization with a CMBP and with random values. In this experiment, 
the structure shown in figure]^ is used. The performance of initialization with random value^re- 

5. MSB is chosen because of two reasons: (i) MSB can directly reflect L2 performance which is one of the main metrics 
in DSTCs. (ii) Bxperiment has shown that some other criterion such as cross-entropy loss cannot lead to better 
performance. 

6 . The parameter of L2 regularization is set to be the one leading to the best performance on dstc2dev when trained 
on dstc2trn. LI regularization is not used since it cannot yield better performance. 

7. The random scheme used is the one with the best performance on dstc2dev when trained on dstc2trn among 
various kinds of random initialization scheme. 
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ported here is the average performance of 10 different random seeds, and their standard deviations 
are given in parentheses. 


Initialization 

dstc2eval 

dstc3eval 

Acc 

L2 

Acc 

L2 

Random 

0.753 (0.008) 

0.468 (0.020) 

0.633 (0.005) 

0.667 (0.026) 

CMBP 

0.756 

0.373 

0.648 

0.553 


Table 2: Performance comparison between the RPN initialized by random values, and the RPN 
initialized by the CMBP coefficients on dstc2eval and dstc3eval. 


The performance of the RPN initialized by random values is compared with the performance 
of the RPN initialized by the integer-coefficient CMBP Here, the CMBP has 12 non-zero coeffi¬ 
cients and has the best performance on dstc2dev when trained on dstc2trn. It can be seen 
from table that the RPN initialized by the CMBP coefficients outperforms the RPN initialized 
by random values moderately on dstc2eval and significantly on dstc3eval (p-value< 0.05). 
This demonstrates the encoded prior knowledge and intuition in CMBP can be transferred to RPN 
to improve RPN’s performance, which is one of RPN’s advantage, bridging rule-based models and 
statistical models. In the rest of the experiments, all RPNs use CMBP coefficients for initialization. 

Since section 4.3 shows that it is convenient to add features and try more complex structures, it 
is interesting to investigate RPNs with different feature sets and structures, as shown in table It 
can be seen that while no obvious correlation between the performance and different configurations 
of feature sets and structures can be observed on dstc2eval and dstc3eval, RPNs with new 
features and new recurrent connections have achieved slightly better performance. Thus, in the rest 
of the paper, both new features and new recurrent connections are used in RPN, unless otherwise 
stated. 


Feature Set 

New Recurrent 

dstc2eval 

dstc3eval 

Connections 

Acc 

L2 

Acc 

L2 

/o ~ h 

No 

0.756 

0.373 

0.648 

0.553 

fo ~ /g 

0.757 

0.374 

0.650 

0.557 

/o ~ h 

Yes 

0.756 

0.373 

0.648 

0.553 

fo ~ /g 

0.757 

0.374 

0.650 

0.549 


Table 3: Performance comparison among RPNs with different configurations on dstc2eval and 
dstc3eval. 


5.2 Comparison with Other DST Approaches 

The previous subsection investigates how to get the RPN with the best configuration. In this subsec¬ 
tion, the performance of RPN is compared to both rule-based and statistical approaches. To make 
fair comparison, all statistical models together with RPN in this subsection use similar feature set. 
Altogether, 2 rule-based trackers and 3 statistical trackers were built for performance comparison. 
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Type 

System 

dstc2eval 

dstc3eval 

Acc 

L2 

Acc 

L2 

Rule 

MaxConf 

0.668 

0.647 

0.548 

0.861 

HWU 

0.720 

0.445 

0.594 

0.570 

Statistical 

DNN 

0.719 

0.469 

0.628 

0.556 

MaxEnt 

0.710 

0.431 

0.607 

0.563 


LSTM 

0.736 

0.418 

0.632 

0.549 

Mixed 

CMBP 

0.755 

0.372 

0.627 

0.546 

RPN 

0.756 

0.373 

0.648 

0.553 


Table 4: Performance comparison among RPN, rule-based and statistical approaches with similar 
feature set on dstc2eval and dstc3eval. The performance of CMBP in the table is 
the performance of the RPN which has been initialized but not been trained. 


• MaxConf is a rule-based model commonly used in spoken dialogue systems which always 
selects the value with the highest confidence score from the 1®* turn to the current turn. It was 
used as one of the primary baselines in DSTC-2 and DSTC-3. 


HWU is a rule-based model proposed by |Wang and Lemon| ( |2013| ) It is regarded as a simple, 
yet competitive baseline of DSTC-2 and DSTC-3. 


DNN is a statistical model using deep neural network model (Sun et al. 2014aI with prob¬ 
ability feature as RPN. Since DNN does not have recurrent structures while RPN does, to 
fairly take into account this, the DNN feature set at the turn is defined as 


U 


,t} 


[p+{v),p-{v),P+{v),p-{v)}u{P{t)} 


where P{t) is the highest confidence score from the 1^* turn to the turn. The DNN has 3 
hidden layers with 64 nodes per layer. 


MaxEnt is also a statistical model using Maximum Entropy model (Sun et al. 2014aI with 
the same input features as DNN. 


LSTM is another statistical model using long short-term memory model (Hochreiter and 


Schmidhuber 19971 with the same input features as RPN. It has similar structure as the DNN 


model ( |Sun et aL||2014a| ) except for its hidden layers using LSTM blocks. The LSTM used 
here has 3 hidden layers with 100 LSTM blocks per layer. 


It can be observed that, with similar feature set, RPN can outperform both rule-based and sta¬ 
tistical approaches in terms of joint goal accuracy. Statistical significance tests were also performed 
assuming a binomial distribution for each turn. RPN was shown to significantly outperform both 
rule-based and statistical approaches at 95% confidence level. Lor L2, RPN is competitive to both 
rule-based and the statistical approaches. 
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5.3 Comparison with State-of-the-art DSTC Trackers 


In the DSTCs, the state-of-the-art trackers mostly employed statistical approaches. Usually, richer 
feature set and more complicated model structures than the statistical models in section 5.2 are used. 
In this section, the proposed RPN approach is compared to the best submitted trackers in DSTC- 
2/3 and the best CMBP trackers, regardless of fairness of feature selection and the SLU refinement 
approach. RPN is compared and the results are shown in table and table Note that structure 
shown in figure [^with richer feature set and a new recurrent connection is used here. 


System 

Approach 

Rank 

Ace 

L2 

Baseline* 

Rule 

5 

0.719 

0.464 


Williams 

(2014 


LambdaMART 

RNN 

DNN 

1 

2 

3 

0.784 

0.768 

0.750 

0.735 

0.346 

0.416 

Henderson et 

al. (2014d) 


Sun et al. 

2014a 

1 


Yu et al. 

2015) 

Real CMBP 

2.5 

0.762 

0.436 

RPN 

RPN 

2.5 

0.757 

0.374 


Table 5: Performance comparison among RPN, real-coefficient CMBP and best trackers of DSTC- 
2 on dstc2eval. Baseline* is the best results from the 4 baselines in DSTC2. 


Note that, in DSTC-2, the |Williams| ( |2014[ )’s system employed batch ASR hypothesis infor¬ 
mation (i.e. off-line ASR re-decoded results) and cannot be used in the normal on-line model in 
practice. Hence, the practically best tracker is [Henderson et al. (2014d I. It can be observed from ta¬ 
ble]^ RPN ranks only second to the best practical tracker among the submitted trackers in DSTC-2 
in accuracy and L2. Considering that RPN only used probabilistic features and very limited added 
features and can operate very efficiently, it is quite competitive. 


System 

Approach 

Rank 

Ace 

L2 

Baseline* 

Rule 

6 

0.575 

0.691 

Henderson et al. 

'(2014c) 

RNN 

1 

0.646 

0.538 


iCadlec et al. ( 

2014 

1 

Rule 

2 

0.630 

0.627 


Sun et al.| 

(2014b 


Int CMBP 

3 

0.610 

0.556 

|Yu et aU 

(!2015i) 

Real CMBP 

1.5 

0.634 

0.579 

RPN 

RPN 

0.5 

0.650 

0.549 


Table 6: Performance comparison among RPN, real-coefficient CMBP and best trackers of DSTC- 
3 on dstc3eval. Baseline* is the best results from the 4 baselines in DSTC3. 


It can be seen from tablej^ RPN trained on DSTC-2 can achieve state-of-the-art performance on 
DSTC-3 without modifying tracking methoc^ outperforming all the submitted trackers in DSTC- 
3 including the RNN system. This demonstrates that RPN successfully inherits the advantage of 
good generalization ability of rule-based model. Considering the feature set and structure of RPN 
are relatively simple in this paper, future work will investigate richer features and more complex 
structures. 


i. The parser is refined for DSTC-3 (Zhu et al. 


20141. 
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6. Conclusion 

This paper proposes a novel hybrid framework, referred to as recurrent polynomial network, to 
bridge the rule-based model and statistical approaches. With the ability of incorporating prior 
knowledge into a statistical framework, RPN has the advantages of both rule-based and statistical 
approaches. Experiments on two DSTC tasks showed that the proposed approach not only is more 
stable than many major statistical approaches, but also has competitive performance, outperforming 
many state-of-the-art trackers. 

Since the RPN in this paper only used probabilistic features and very limited added features, the 
performance of RPN can be influenced by how reliable the SLU’s confidence scores are. Therefore, 
future work will investigate the influence of SLUs on the performance of RPN, and rich features 
for RPN. Moreover, future work will also address applying RPN to other domains, such as the bus 
timetables domain in DSTC-1, and theoretic analysis of RPN. 
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Appendix 

Derivative calculation 


Using MSE as the criterion, is initialized as following^ 



(27) 


9. The symbols used in this section such as /, 7, M, M follow the definitions in section 3 


23 








Sun, Xie and Yu 


Suppose node x is an aetivation node and /(•) = softclip{-), let y = jx, 

dC 
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du^ 
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dui^ duy^ 
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Suppose node x = {d, i) is a sum node, then when node x passes its error, the error of node 
y G /a; is updated as 


= + ^x^Wx,y 
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Similarly, error of node y ^ Ix is updated 


as 
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Suppose node x = {d, i) is a produet node, then when node x passes its error, error of node 
y G /a; is updated as 
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Similarly, error of node y G la; is updated 


as 


A(t) =a(0 , 

I /x\ u 


it) 


dui^ duy 




.it) 


Mx 


n 


, 0 - 1 ) 


Mx 


(32) 


z(^ix 


z£lx-{y} 


24 



